Starting a bit late, but very hopeful! I am really interested in the topic, but technical difficulties are driving me insane. Also very lost for now.
Read the data
learning2014 <- read.csv("./data/learning2014.csv")
For data structure:
dim(learning2014)
## [1] 166 8
str(learning2014)
## 'data.frame': 166 obs. of 8 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
head(learning2014)
## X gender age attitude deep stra surf points
## 1 1 F 53 3.7 3.583333 3.375 2.583333 25
## 2 2 M 55 3.1 2.916667 2.750 3.166667 12
## 3 3 F 49 2.5 3.500000 3.625 2.250000 24
## 4 4 M 53 3.5 3.500000 3.125 2.250000 10
## 5 5 M 49 3.7 3.666667 3.625 2.833333 22
## 6 6 F 38 3.8 4.750000 3.625 2.416667 21
Explore the dataset
summary(learning2014)
## X gender age attitude deep
## Min. : 1.00 F:110 Min. :17.00 Min. :1.400 Min. :1.583
## 1st Qu.: 42.25 M: 56 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333
## Median : 83.50 Median :22.00 Median :3.200 Median :3.667
## Mean : 83.50 Mean :25.51 Mean :3.143 Mean :3.680
## 3rd Qu.:124.75 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083
## Max. :166.00 Max. :55.00 Max. :5.000 Max. :4.917
## stra surf points
## Min. :1.250 Min. :1.583 Min. : 7.00
## 1st Qu.:2.625 1st Qu.:2.417 1st Qu.:19.00
## Median :3.188 Median :2.833 Median :23.00
## Mean :3.121 Mean :2.787 Mean :22.72
## 3rd Qu.:3.625 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :5.000 Max. :4.333 Max. :33.00
library(dplyr)
keep <- c("Country", "Edu2.FM", "Labo.FM", "Life.Exp", "Edu.Exp", "GNI", "Mat.Mor", "Ado.Birth", "Parli.F")
human <- select(human, one_of(keep))
In R, NA stands for not available, which means that the data point is missing. If a variable you wish to analyse contains missing values, there are usually two main options:
1. Remove the observations with missing values
2. Replace the missing values with actual values using an imputation technique.
Option 1:
# print out a completeness indicator of the 'human' data
complete.cases(human)
# print out the data along with a completeness indicator as the last column
data.frame(human[-1], comp = complete.cases(human))
# filter out all rows with NA values
human_ <- filter(human, complete.cases(human))
# look at the last 10 observations of human
tail(human, 10)
# define the last indice we want to keep
last <- nrow(human) - 7
# choose everything until the last 7 observations
human_ <- human[1:last, ]
# add countries as rownames
rownames(human_) <- human_$Country
library(GGally)
## Loading required package: ggplot2
library(ggplot2)
pairs(learning2014[-1], col = learning2014$gender)
p <- ggpairs(learning2014, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
p
Before starting any analysis, it is good to look at the probability plots of each variable.
Options:
1. qqnorm( ) function to create a Quantile-Quantile plot evaluating the fit of sample data to the normal distribution.
2. The fitdistr( ) function in the MASS package provides maximum-likelihood fitting of univariate distributions. The format is fitdistr(x, densityfunction) where x is the sample data and density function is one of the following: “beta”, “cauchy”, “chi-squared”, “exponential”, “f”, “gamma”, “geometric”, “log-normal”, “lognormal”, “logistic”, “negative binomial”, “normal”, “Poisson”, “t” or “weibull”.
3. For other, see: https://cran.r-project.org/doc/contrib/Ricci-distributions-en.pdf
If the distribution does satisfy the normality criterion, log-transform the variable to identify outliers more clearly.
Delete those observations.
Try step 1 again.
library(ggplot2)
Create a dummy variable with the same mean and standard deviation, but normally distributed, for comparison purposes:
absences_norm <- rnorm(200,mean=mean(absences, na.rm=TRUE), sd=sd(absences, na.rm=TRUE))Plot the two variables side by side
boxplot(absences, absences_norm, main = "Boxplots of the absences variable: actual distribution vs. normal", at = c(1,2), names = c("absences","absences_norm"), las = 2, col = c("pink","yellow"), horizontal = TRUE)library(tidyverse), library(MASS)
Calculate the correlation matrix and round it (tidyverse and corrplot packages)
cor_matrix <- cor(Boston)
cor_matrix_r <- cor_matrix %>% round(2)
# округлить до 2 цифр после запятойVisualize the correlation matrix
corrplot(cor_matrix_r, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)
pairs() command (or ggpairs with GGally package)With pairs() you can reduce the number of pairs to see the plots more clearly. After the data argument, you have to specify the columns you want to see. Ex.: pairs(Boston[6:10]), col = km$cluster)
scale()
boston_scaled <- scale(Boston), # mean = 0, центрирует вокруг нуля
class(boston_scaled) # в результате шкалирования получается матрица
boston_scaled <- as.data.frame(boston_scaled) # меняем матрицу на data frame
Quantiles
1. Create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins
Create a categorical variable ‘crime’
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label = c("low","med_low","med_high","high"))Look at the table of the new factor crime
table(crime)Remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)Add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)We are now interested in exploring whether we can explain the student exam grade by relating it with other available variables. A multiple regression model is fitted with a response variable ‘points’ and explanatory variables ‘attitude’, ‘stra’ and ‘surf’.
lm <- lm(points ~ attitude + stra + surf, data = learning2014)
lm
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learning2014)
##
## Coefficients:
## (Intercept) attitude stra surf
## 11.0171 3.3952 0.8531 -0.5861
summary(lm)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
The model with three explanatory variables is able to account for 21% of the variance (multiple R squared). It shows that the only variable with a statistically significant relationship with examn grade is ‘attitude’. Now we have to try out and take the two other factors one by one in a search of a more parsimonious model.
lm1 <- lm(points ~ attitude + stra, data = learning2014)
lm1
##
## Call:
## lm(formula = points ~ attitude + stra, data = learning2014)
##
## Coefficients:
## (Intercept) attitude stra
## 8.9729 3.4658 0.9137
summary(lm1)
##
## Call:
## lm(formula = points ~ attitude + stra, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.6436 -3.3113 0.5575 3.7928 10.9295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.9729 2.3959 3.745 0.00025 ***
## attitude 3.4658 0.5652 6.132 6.31e-09 ***
## stra 0.9137 0.5345 1.709 0.08927 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared: 0.2048, Adjusted R-squared: 0.1951
## F-statistic: 20.99 on 2 and 163 DF, p-value: 7.734e-09
lm2 <- lm(points ~ attitude + surf, data = learning2014)
lm2
##
## Call:
## lm(formula = points ~ attitude + surf, data = learning2014)
##
## Coefficients:
## (Intercept) attitude surf
## 14.120 3.426 -0.779
summary(lm2)
##
## Call:
## lm(formula = points ~ attitude + surf, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.277 -3.236 0.386 3.977 10.642
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.1196 3.1271 4.515 1.21e-05 ***
## attitude 3.4264 0.5764 5.944 1.63e-08 ***
## surf -0.7790 0.7956 -0.979 0.329
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 163 degrees of freedom
## Multiple R-squared: 0.1953, Adjusted R-squared: 0.1854
## F-statistic: 19.78 on 2 and 163 DF, p-value: 2.041e-08
Taking out the ‘surf’ or ‘stra’ variables did not affect the multiple R squared and thus the fit of the model. It makes it safe to conclude that they do not affect the exam grade and we can proceed with the only significant factor ‘attitude’.
lm3 <- lm(points ~ attitude, data = learning2014)
lm3
##
## Call:
## lm(formula = points ~ attitude, data = learning2014)
##
## Coefficients:
## (Intercept) attitude
## 11.637 3.525
summary(lm3)
##
## Call:
## lm(formula = points ~ attitude, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
The fitted model explains 19% of the variance, the attitude is concluded to be a statistically significant predictor for the exam grade.
Next, to make sure that the model is fitted correctly, residuals analysis is due.
Residuals vs. Fitted values
plot(lm3, which = c(1))
QQ plot (theoretical quantiles)
plot(lm3, which = c(2))
Residuals vs. Leverage
plot(lm3, which = c(5))
The residuals vs. fitted values are equally distributed and do not demonsrate any pattern. The QQ plot shows a nice fit along the line with no cause for concern. The Leverage plot does nor show any particular outliers.
The model is valid.
Determine the number of rows in the dataset
n <- nrow(boston_scaled)Choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)Create train set
train <- boston_scaled[ind,]Create test set by substraction the train set
test <- boston_scaled[-ind,]Save the correct classes from test data
correct_classes <- test$crime (any variable used for prediction)Remove this variable from test data
test <- dplyr::select(test, -crime)
This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features; it was collected by using school reports and questionnaires.
alc <- read.csv("./data/alc.csv")
The dataset contains 35 variables and 382 observations pertaining to the performance, family, personal life and other information about students and their consumption of alcohol.
Here, I am interested in studying the relationships between high/low alcohol consumption and other variables in the data, namely, four of them: ‘absences’ (number of school absences), ‘traveltime’ (home to school travel time), ‘goout’ (going out with friends), and ‘famrel’ (quality of family relationships).
I hypothesize that the high number of absences and a lot of time with friends along with bad family relationships and short traveltime will increase odds of high alcohol consumption in students.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
##
## nasa
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
keep_columns <- c("sex", "age", "absences", "traveltime", "goout", "famrel", "high_use")
alc <- select(alc, one_of(keep_columns))
sex <- alc$sex
age <- alc$age
absences <- alc$absences
traveltime <- alc$traveltime
goout <- alc$goout
famrel <- alc$famrel
high_use <- alc$high_use
We have now reformatted the dataset to only keep the columns pertaining to the hypothesis and two general variables, sex and age. Now onto the graphical exploration.
library(ggplot2)
# create a dummy variable with the same mean and standard deviation, but normally distributed, for comparison purposes
absences_norm <- rnorm(200,mean=mean(absences, na.rm=TRUE), sd=sd(absences, na.rm=TRUE))
# plot the two variables side by side
boxplot(absences, absences_norm, main = "Boxplots of the absences variable: actual distribution vs. normal", at = c(1,2), names = c("absences","absences_norm"), las = 2, col = c("pink","yellow"), horizontal = TRUE)
First, we need to analyze the only numeric variable, the number of absences. As seen in the boxplot, the actual distribution is within normal limits (with no negative values) and with a few outliers. Before excluding those observations, we will explore other variable and try and fit the model.
Now we will plot the interval variables with a scale of answers from 1 to 5.
boxplot(traveltime, goout, famrel, main = "Boxplots of interval variables", at = c(1,2,3), names = c("travel time","time with friends", "family relationships"), las = 2, col = c("blue","violet","green"))
The “goout” variable seems to be distributed normally, although two others are skewed in different directions. Next, we will plot each variable against the response variable and explore the dataset further.
library(GGally)
cross <- ggpairs(alc, mapping = aes(col = sex), lower = list(combo = wrap("facethist", bins = 20)))
cross
The variables have very low correlation values between themselves, so there is no risk of covariance tamperinf with the model. Now onto the model fitting.
m <- glm(high_use ~ absences + traveltime + goout + famrel, data = alc, family = "binomial")
summary(m)
##
## Call:
## glm(formula = high_use ~ absences + traveltime + goout + famrel,
## family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0083 -0.7805 -0.5437 0.8464 2.3984
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.97770 0.68176 -4.368 1.26e-05 ***
## absences 0.07559 0.02202 3.432 0.000598 ***
## traveltime 0.43957 0.17532 2.507 0.012167 *
## goout 0.76735 0.12087 6.348 2.18e-10 ***
## famrel -0.36294 0.13756 -2.638 0.008331 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 389.46 on 377 degrees of freedom
## AIC: 399.46
##
## Number of Fisher Scoring iterations: 4
All of the explanatory variables have a statistically significant relationship with the target variable.
Next, we will calculate the odds ratios and confidence intervals to validate our model.
odds_ratios <- coef(m) %>% exp
conf_int <- confint(m) %>% exp
## Waiting for profiling to be done...
cbind(odds_ratios, conf_int)
## odds_ratios 2.5 % 97.5 %
## (Intercept) 0.0509099 0.01285711 0.1879144
## absences 1.0785172 1.03437018 1.1291483
## traveltime 1.5520466 1.10198801 2.1968917
## goout 2.1540600 1.71012923 2.7497327
## famrel 0.6956280 0.52984131 0.9103098
The odds of high alcohol consumption for: 1. absent students is from 1.03 and 1.12 higher than for attending students.
2. students with shorter travel time to school is 1.10 to 2.19 higher than for those who have to take the long road.
3. students who spend a lot of time with their friends is from 1.7 to 2.7 times higher for students who do not.
4. students with a good family situation is between 52% and 91% of the odds of students with bad family relationships.
To continue fitting the model, we will compare predicted and actual values.
# predict the probability of high_use
probabilities <- predict(m, type = "response")
alc <- mutate(alc, probability = probabilities)
# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = (probability > 0.5))
select(alc, absences, traveltime, goout, famrel, high_use, probability, prediction) %>% head(10)
## absences traveltime goout famrel high_use probability prediction
## 1 5 2 4 4 FALSE 0.47428353 FALSE
## 2 3 1 3 5 FALSE 0.13895457 FALSE
## 3 8 1 2 4 TRUE 0.13581670 FALSE
## 4 1 1 2 3 FALSE 0.11746601 FALSE
## 5 2 1 2 4 FALSE 0.09079210 FALSE
## 6 8 1 2 5 FALSE 0.09855191 FALSE
## 7 0 1 4 4 FALSE 0.28486278 FALSE
## 8 4 2 4 4 FALSE 0.45548223 FALSE
## 9 0 1 2 4 FALSE 0.07906088 FALSE
## 10 0 1 1 5 FALSE 0.02697575 FALSE
prediction_plot <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))
prediction_plot + geom_point()
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table() %>% addmargins()
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.64921466 0.05235602 0.70157068
## TRUE 0.18586387 0.11256545 0.29842932
## Sum 0.83507853 0.16492147 1.00000000
Not totally amazing, but still. The penultimate step would be to calculate the accuracy of the model.
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# call loss_func to compute the average number of wrong predictions in the training data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2382199
Quite a good result!
Finally, we perform the 10-fold cross-validation.
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2329843
Sliiiiightly better than 0.26 by the DataCamp model.
The first thing in today’s agenda is to load the Boston dataset from r.
# loading
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
data("Boston")
# initial dataset exploration
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
This dataset contains 506 observations over 14 variables, 2 of them interval and the other ones numerical. The indices were measured across the following variables:
1. ‘crim’ (per capita crime rate by town)
2. ‘zn’ (proportion of residential land zoned for lots over 25,000 sq.ft)
3. ‘indus’ (proportion of non-retail business acres per town)
4. ‘chas’ (Charles River dummy variable (= 1 if tract bounds river; 0 otherwise))
5. ‘nox’ (nitrogen oxides concentration (parts per 10 million)) 6. ‘rm’ (average number of rooms per dwelling)
7. ‘age’ (proportion of owner-occupied units built prior to 1940) 8. ‘dis’ (weighted mean of distances to five Boston employment centres)
9. ‘rad’ (index of accessibility to radial highways) 10. ‘tax’ (full-value property-tax rate per $10,000) 11. ‘ptratio’ (pupil-teacher ratio by town)
12. ‘black’ (1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town)
13. ‘lstat’ (lower status of the population (percent))
14. ‘medv’ (median value of owner-occupied homes in $1000s)
The variables have very different ranges, which probably means standartisation before the analysis. Now onto a graphical overview.
pairs(Boston)
A lot of variable seem to be strongly correlated. The plot for ‘nox’ and ‘dis’, for instance, are seemingly related through 1/x function. It probably makes sense since air pollution is dissipating the farther you from the city center.
To check whether this is just an observation or a valid hypothesis, let’s check for correlations.
# calculate the correlation matrix and round it
cor_matrix <- cor(Boston)
# round up to 2 digits
cor_matrix_r <- cor_matrix
round(cor_matrix_r, 2)
## crim zn indus chas nox rm age dis rad tax
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47
## ptratio black lstat medv
## crim 0.29 -0.39 0.46 -0.39
## zn -0.39 0.18 -0.41 0.36
## indus 0.38 -0.36 0.60 -0.48
## chas -0.12 0.05 -0.05 0.18
## nox 0.19 -0.38 0.59 -0.43
## rm -0.36 0.13 -0.61 0.70
## age 0.26 -0.27 0.60 -0.38
## dis -0.23 0.29 -0.50 0.25
## rad 0.46 -0.44 0.49 -0.38
## tax 0.46 -0.44 0.54 -0.47
## ptratio 1.00 -0.18 0.37 -0.51
## black -0.18 1.00 -0.37 0.33
## lstat 0.37 -0.37 1.00 -0.74
## medv -0.51 0.33 -0.74 1.00
# visualize
library(corrplot)
## corrplot 0.84 loaded
corrplot(cor_matrix_r, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)
The hypothesis is correct, the correlation coefficient is -0.77.
For classification and clustering purposes, next step would be to scale the dataset to avoid skewing results.
# scaling around 0
boston_scaled <- scale(Boston)
# getting a matrix as a result
class(boston_scaled)
## [1] "matrix"
# transforming the matrix into a new dataset
boston_scaled <- as.data.frame(boston_scaled)
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
Now the variables are on the same scale with a mean of 0. In order to work with classification, we need a binary or multiclass target variable. In this instance, we are most interested in classifying neighbourhoods accroding to crime rates, that is why we will next transform the ‘crim’ variable into categorical one via quantiles.
# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label = c("low","med_low","med_high","high"))
# look at the table of the new factor crime
table(crime)
## crime
## low med_low med_high high
## 127 126 126 127
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
We are done with data preparation and now can go on separating the test and training sets for further model training and validation.
# determine the number of rows in the dataset
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create a train set
train <- boston_scaled[ind,]
# create a test set by substraction of the train set
test <- boston_scaled[-ind,]
All the steps in preparing the data are complete.
Fitting the Linear Discriminant Analysis to the train data:
lda.fit <- lda(crime ~ ., data = train)
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2524752 0.2500000 0.2500000 0.2475248
##
## Group means:
## zn indus chas nox rm
## low 1.0283825 -0.9254675 -0.1179329800 -0.8870913 0.5154579
## med_low -0.0890347 -0.2637405 0.0005392655 -0.5737786 -0.1221328
## med_high -0.3836558 0.1670026 0.0785016464 0.2956172 0.1826823
## high -0.4872402 1.0171519 -0.0361030535 1.0417446 -0.4651386
## age dis rad tax ptratio
## low -0.8712921 0.9065474 -0.6868728 -0.7589060 -0.48212247
## med_low -0.3704369 0.3522783 -0.5543231 -0.4677380 -0.03479791
## med_high 0.4001481 -0.3404595 -0.4030893 -0.3139393 -0.22550538
## high 0.7915808 -0.8456922 1.6377820 1.5138081 0.78037363
## black lstat medv
## low 0.3783065 -0.78310219 0.59028998
## med_low 0.3095684 -0.16135853 0.02323866
## med_high 0.1198860 -0.03287276 0.21830651
## high -0.7105463 0.89393387 -0.70226522
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.088637397 0.7705277789 -0.87559252
## indus 0.003806011 -0.2953748054 0.19938155
## chas -0.052995147 -0.0379711812 0.12389336
## nox 0.434738952 -0.6128463135 -1.34931264
## rm -0.110224811 -0.1074986266 -0.21470470
## age 0.228428598 -0.2788587976 -0.31931173
## dis -0.049867309 -0.2294623006 0.06557595
## rad 3.192061371 0.9594481947 -0.16939902
## tax -0.032933776 -0.0601589090 0.73622549
## ptratio 0.125103415 0.0154585653 -0.21589998
## black -0.100724641 0.0001947518 0.10132264
## lstat 0.240022264 -0.2013371225 0.30842566
## medv 0.170982945 -0.3182322020 -0.19622623
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9500 0.0365 0.0135
The mean group results demonstrate that the three variables that are mostly influencing high crime rates are the index of accessibility to radial highways (1.63), full-value property-tax rate per $10,000 (1.51), and, interestingly, pollution rates (1.054).
The linear discriminants coefficient confirm those findings for the radial highways accessaibility which is three times higher than all the other variables in the dataset.
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "black", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 3)
The plot confirms the prediction for the biggest influence of accessibility to radial highways.
Next, we will test the model that we have fitted on the test set, which we have previously separated from the data.
# save the correct classes from test data
correct_classes <- test$crime
# remove this variable from test data
test <- dplyr::select(test, -crime)
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 12 12 1 0
## med_low 4 16 5 0
## med_high 0 4 20 1
## high 0 0 0 27
The model looks quite adequate, there is a just a small number of mismatches in the classification.
Next, after trying to assign the measurements to previously determined classes, now we well turn to unsupervised methods. K-clustering begins, as usual, with loading and scaling data.
# loading
library(MASS)
data("Boston")
# scaling around 0
boston_scaled <- scale(Boston)
# getting a matrix as a result
class(boston_scaled)
## [1] "matrix"
# transforming the matrix into a new dataset
boston_scaled <- as.data.frame(boston_scaled)
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
After the standartisation is complete, we need to calculate the distances for the scaled data.
# euclidean distance matrix
dist_eu <- dist(boston_scaled)
# look at the summary of the distances
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
Next, we will try out a random number of clusters.
# k-means clustering
km <-kmeans(boston_scaled, centers = 10)
# plot the Boston dataset with clusters
pairs(boston_scaled, col = km$cluster)
The plot looks very colourful, but is is obvious that the number of clusters is too big. Right now we will reduce it in half.
# k-means clustering
km <-kmeans(boston_scaled, centers = 5)
# plot the Boston dataset with clusters
pairs(boston_scaled, col = km$cluster)
Looks better, but still confusing. It it time to more mathematical methods of identifying the right number of clusters for this dataset.
set.seed(123)
# determine the maximum number of clusters
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})
# visualize the results
library(ggplot2)
qplot(x = 1:k_max, y = twcss, geom = 'line')
According to the total sum of squares plot, the biggest observable “elbow” is set at 2. Therefore, we will run the k-means analysis with 2 centroids.
# k-means clustering
km <-kmeans(boston_scaled, centers = 2)
# plot the Boston dataset with clusters
pairs(boston_scaled, col = km$cluster)
It turns out, that the most fitting number of classes for this dataset was just 2. A bit underwhelming though.
human <- read.csv("./data/human.csv", row.names = 1)
str(human)
## 'data.frame': 155 obs. of 8 variables:
## $ sed_ratio : num 0.993 1.003 1.017 1.012 1.032 ...
## $ labor_ratio: num 0.891 0.819 0.825 0.884 0.829 ...
## $ ed_exp : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ lifeexp : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ gni : int 64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
## $ matmort : int 4 6 6 5 6 7 9 28 11 8 ...
## $ teenbirths : num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ parl : num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
summary(human)
## sed_ratio labor_ratio ed_exp lifeexp
## Min. :0.6681 Min. :0.1857 Min. : 5.40 Min. :49.00
## 1st Qu.:1.0032 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30
## Median :1.0667 Median :0.7535 Median :13.50 Median :74.20
## Mean :1.3523 Mean :0.7074 Mean :13.18 Mean :71.65
## 3rd Qu.:1.3766 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25
## Max. :5.8235 Max. :1.0380 Max. :20.20 Max. :83.50
## gni matmort teenbirths parl
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
library(GGally)
pairs <- ggpairs(human)
pairs
library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✔ tibble 2.1.3 ✔ purrr 0.3.3
## ✔ tidyr 1.0.0 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ───────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ MASS::select() masks dplyr::select()
library(corrplot)
# compute the correlation matrix and visualize it with corrplot
cor(human) %>% corrplot()
All of the variables differ dramatically in their ranges. Two of them, GNI per capita and maternal mortality right are significantly skewed to the right; others seem to be distributed more or less normally.
This dataframe contains 155 observations and 8 variables, 6 of them numerical and two interval. Based on their distributions, we can see that it’s a highly intercorrelated dataset, which is perfect for the purposes of the Principal Component Analysis.
Some of the variables in the data are strongly positively or negatively correlated: for instance, maternal mortality quite logically has a strong positive correlation with adolescent women giving birth. On the other hand, maternal mortality is strongly negatively correlated with expected education.
First, let’s try Principal Component Analysis on non-scaled data.
pca_human <- prcomp(human)
pca_human
## Standard deviations (1, .., p=8):
## [1] 1.854416e+04 1.855223e+02 2.518704e+01 1.145442e+01 3.766254e+00
## [6] 1.569207e+00 5.345221e-01 1.720831e-01
##
## Rotation (n x k) = (8 x 8):
## PC1 PC2 PC3 PC4
## sed_ratio 1.438861e-05 -0.0022503438 0.0016047293 -1.006305e-03
## labor_ratio 2.331945e-07 -0.0002819336 -0.0005303524 4.692746e-03
## ed_exp -9.562910e-05 0.0075529685 -0.0142769848 3.313605e-02
## lifeexp -2.815823e-04 0.0283149570 -0.0129494322 6.752623e-02
## gni -9.999832e-01 -0.0057723207 0.0005156755 -4.932869e-05
## matmort 5.655734e-03 -0.9916297159 -0.1260337001 6.102337e-03
## teenbirths 1.233961e-03 -0.1255500502 0.9918095255 -5.299293e-03
## parl -5.526460e-05 0.0032317269 0.0073979127 9.971228e-01
## PC5 PC6 PC7 PC8
## sed_ratio 0.0034964012 -7.260101e-02 -9.945465e-01 -7.473545e-02
## labor_ratio 0.0022140041 3.308785e-02 7.249269e-02 -9.968063e-01
## ed_exp 0.1430816370 9.858861e-01 -7.363222e-02 2.784972e-02
## lifeexp 0.9865668049 -1.447742e-01 1.397970e-02 -1.280901e-03
## gni -0.0001135872 -2.707883e-05 1.066685e-06 -1.814719e-07
## matmort 0.0266316436 1.829585e-03 1.946440e-03 6.376929e-04
## teenbirths 0.0188547462 1.284169e-02 1.016751e-03 2.495532e-05
## parl -0.0716358797 -2.313031e-02 1.488159e-04 3.773314e-03
# rounded percentages of variance captured by each PC
s <- summary(pca_human)
pca_pr <- round(100*s$importance[2,], digits = 1)
# create object pc_lab to be used as axis labels
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
# draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2, cex = c(0.6, 1), col = c("grey40", "yellow"), xlab = pc_lab[1], ylab = pc_lab[2], main = (title = "PCA_non-scaled"))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
We can immediately see from the summary of the model and the plot that the first component takes on 100% of the variance. This is due to the difference in ranges of the variables. For instance, the GNI per capita is represented by the longest axis, clearly has the biggest standard deviation. All the arrows are sitting on the same axis as if they are fully correlated.
Therefore, this analysis cannot be really taken into account.
# standardize the variables
human_std <- scale(human)
summary(human_std)
## sed_ratio labor_ratio ed_exp lifeexp
## Min. :-0.92829 Min. :-2.6247 Min. :-2.7378 Min. :-2.7188
## 1st Qu.:-0.47369 1st Qu.:-0.5484 1st Qu.:-0.6782 1st Qu.:-0.6425
## Median :-0.38753 Median : 0.2316 Median : 0.1140 Median : 0.3056
## Mean : 0.00000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.03306 3rd Qu.: 0.7350 3rd Qu.: 0.7126 3rd Qu.: 0.6717
## Max. : 6.06675 Max. : 1.6632 Max. : 2.4730 Max. : 1.4218
## gni matmort teenbirths parl
## Min. :-0.9193 Min. :-0.6992 Min. :-1.1325 Min. :-1.8203
## 1st Qu.:-0.7243 1st Qu.:-0.6496 1st Qu.:-0.8394 1st Qu.:-0.7409
## Median :-0.3013 Median :-0.4726 Median :-0.3298 Median :-0.1403
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.3712 3rd Qu.: 0.1932 3rd Qu.: 0.6030 3rd Qu.: 0.6127
## Max. : 5.6890 Max. : 4.4899 Max. : 3.8344 Max. : 3.1850
pca_human_std <- prcomp(human_std)
pca_human_std
## Standard deviations (1, .., p=8):
## [1] 2.0686801 1.1395786 0.8731156 0.8198145 0.6157250 0.5399602 0.4589378
## [8] 0.3258813
##
## Rotation (n x k) = (8 x 8):
## PC1 PC2 PC3 PC4 PC5
## sed_ratio 0.35428486 -0.03349747 0.26152597 -0.62161810 0.6060082
## labor_ratio 0.05429201 0.72405070 -0.60003871 0.01397768 0.2272492
## ed_exp -0.42790065 0.14040242 -0.07383829 -0.09101094 0.1679844
## lifeexp -0.44475461 -0.02380824 0.10661368 -0.05928025 0.1612152
## gni -0.34485721 0.04901234 -0.12633930 -0.73711731 -0.5106521
## matmort 0.43894896 0.14353662 -0.10075255 -0.22918529 -0.1688594
## teenbirths 0.41461440 0.07478342 0.04257017 0.01639864 -0.4720253
## parl -0.08438722 0.65249620 0.72581969 0.07390325 -0.1217201
## PC6 PC7 PC8
## sed_ratio -0.15511613 -0.063047732 -0.15255603
## labor_ratio -0.04403599 -0.230781162 -0.07562945
## ed_exp -0.37880046 0.780514915 -0.05040850
## lifeexp -0.41364089 -0.430761634 0.63568952
## gni 0.14453049 -0.125313539 -0.14835638
## matmort 0.20112665 0.359359700 0.72521669
## teenbirths -0.76068241 -0.055924672 -0.12588884
## parl 0.13940058 0.005956103 -0.02382852
# rounded percentages of variance captured by each PC
s_st <- summary(pca_human_std)
pca_pr_st <- round(100*s_st$importance[2,], digits = 1)
# create object pc_lab to be used as axis labels
pc_lab_st <- paste0(names(pca_pr_st), " (", pca_pr_st, "%)")
# draw a biplot of the principal component representation and the original variables
biplot(pca_human_std, choices = 1:2, cex = c(0.6, 1), col = c("grey40", "green"), xlab = pc_lab_st[1], ylab = pc_lab_st[2], main = (title = "PCA_scaled"))
The current analysis looks much more reliable. The countries are distibured throughout the bidimensional space. The first two principal components add up to 69.7% of the variance, typical for PCA.
Speculating about how to call the two dimensions, I would suggest “welfare” for PC1, because it collects indicators of health, social protection, and economic growth, and “participation” for PC2, since it is dealing with workforce and the participation rate. In the former, the variables mostly contributing to it are life expectancy at birth, GNI per capita, maternal maternity ratio, adolescent birth rate, expected years of education, and population with secondary education ratio. For the latter, the two main factors are representation in parliament and labor force prticipation ratio. Judging by the angles between the arrows along the two axis, the variables are closely correlated with each other. The length of the arrows became much more equal after the standardisation.
library(FactoMineR)
data(tea)
dim(tea)
## [1] 300 36
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
## $ frequency : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
summary(tea)
## breakfast tea.time evening lunch
## breakfast :144 Not.tea time:131 evening :103 lunch : 44
## Not.breakfast:156 tea time :169 Not.evening:197 Not.lunch:256
##
##
##
##
##
## dinner always home work
## dinner : 21 always :103 home :291 Not.work:213
## Not.dinner:279 Not.always:197 Not.home: 9 work : 87
##
##
##
##
##
## tearoom friends resto pub
## Not.tearoom:242 friends :196 Not.resto:221 Not.pub:237
## tearoom : 58 Not.friends:104 resto : 79 pub : 63
##
##
##
##
##
## Tea How sugar how
## black : 74 alone:195 No.sugar:155 tea bag :170
## Earl Grey:193 lemon: 33 sugar :145 tea bag+unpackaged: 94
## green : 33 milk : 63 unpackaged : 36
## other: 9
##
##
##
## where price age sex
## chain store :192 p_branded : 95 Min. :15.00 F:178
## chain store+tea shop: 78 p_cheap : 7 1st Qu.:23.00 M:122
## tea shop : 30 p_private label: 21 Median :32.00
## p_unknown : 12 Mean :37.05
## p_upscale : 53 3rd Qu.:48.00
## p_variable :112 Max. :90.00
##
## SPC Sport age_Q frequency
## employee :59 Not.sportsman:121 15-24:92 1/day : 95
## middle :40 sportsman :179 25-34:69 1 to 2/week: 44
## non-worker :64 35-44:40 +2/day :127
## other worker:20 45-59:61 3 to 6/week: 34
## senior :35 +60 :38
## student :70
## workman :12
## escape.exoticism spirituality healthy
## escape-exoticism :142 Not.spirituality:206 healthy :210
## Not.escape-exoticism:158 spirituality : 94 Not.healthy: 90
##
##
##
##
##
## diuretic friendliness iron.absorption
## diuretic :174 friendliness :242 iron absorption : 31
## Not.diuretic:126 Not.friendliness: 58 Not.iron absorption:269
##
##
##
##
##
## feminine sophisticated slimming
## feminine :129 Not.sophisticated: 85 No.slimming:255
## Not.feminine:171 sophisticated :215 slimming : 45
##
##
##
##
##
## exciting relaxing effect.on.health
## exciting :116 No.relaxing:113 effect on health : 66
## No.exciting:184 relaxing :187 No.effect on health:234
##
##
##
##
##
library(ggplot2)
gather(tea) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped
The “tea” dataset contains 300 observations and 36 variables pertaining to people’s tea-taking habits, most of which are strings. Some follow a bimodal distributions, some not.
The current dataframe is too big for a meaningful MCA analysis to my taste, so I chose nine to my best interest.
# creating a subset
keep_columns <- c("Tea", "How", "how", "sugar", "home", "breakfast", "age", "sex", "frequency")
tea_time <- dplyr::select(tea, one_of(keep_columns))
summary(tea_time)
## Tea How how sugar
## black : 74 alone:195 tea bag :170 No.sugar:155
## Earl Grey:193 lemon: 33 tea bag+unpackaged: 94 sugar :145
## green : 33 milk : 63 unpackaged : 36
## other: 9
##
##
## home breakfast age sex
## home :291 breakfast :144 Min. :15.00 F:178
## Not.home: 9 Not.breakfast:156 1st Qu.:23.00 M:122
## Median :32.00
## Mean :37.05
## 3rd Qu.:48.00
## Max. :90.00
## frequency
## 1/day : 95
## 1 to 2/week: 44
## +2/day :127
## 3 to 6/week: 34
##
##
str(tea_time)
## 'data.frame': 300 obs. of 9 variables:
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ breakfast: Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ frequency: Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
library(ggplot2)
gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped
A lot of young people and Earl Grey lovers who drink more than two cups a day (just like me).
Prior to performing MCA, we need to transform the variables into factors, because MCA only operates on categorical variables.
# prepare the data for MCA by converting all the variables to factors
colnames(tea_time)
## [1] "Tea" "How" "how" "sugar" "home" "breakfast"
## [7] "age" "sex" "frequency"
tea_time$Tea <- factor(tea_time$Tea)
tea_time$How <- factor(tea_time$How)
tea_time$how <- factor(tea_time$how)
tea_time$sugar <- factor(tea_time$sugar)
tea_time$home <- factor(tea_time$home)
tea_time$breakfast <- factor(tea_time$breakfast)
tea_time$age <- factor(tea_time$age)
tea_time$sex <- factor(tea_time$sex)
tea_time$frequency <- factor(tea_time$frequency)
# multiple correspondence analysis
mca <- MCA(tea_time, graph = FALSE)
# summary of the model
summary(mca)
##
## Call:
## MCA(X = tea_time, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## Variance 0.234 0.205 0.191 0.190 0.178 0.170
## % of var. 2.843 2.491 2.324 2.309 2.164 2.067
## Cumulative % of var. 2.843 5.334 7.658 9.968 12.132 14.199
## Dim.7 Dim.8 Dim.9 Dim.10 Dim.11 Dim.12
## Variance 0.162 0.157 0.153 0.150 0.147 0.139
## % of var. 1.973 1.914 1.866 1.825 1.793 1.687
## Cumulative % of var. 16.172 18.086 19.952 21.776 23.569 25.257
## Dim.13 Dim.14 Dim.15 Dim.16 Dim.17 Dim.18
## Variance 0.130 0.125 0.111 0.111 0.111 0.111
## % of var. 1.575 1.522 1.351 1.351 1.351 1.351
## Cumulative % of var. 26.832 28.354 29.705 31.056 32.408 33.759
## Dim.19 Dim.20 Dim.21 Dim.22 Dim.23 Dim.24
## Variance 0.111 0.111 0.111 0.111 0.111 0.111
## % of var. 1.351 1.351 1.351 1.351 1.351 1.351
## Cumulative % of var. 35.110 36.462 37.813 39.164 40.516 41.867
## Dim.25 Dim.26 Dim.27 Dim.28 Dim.29 Dim.30
## Variance 0.111 0.111 0.111 0.111 0.111 0.111
## % of var. 1.351 1.351 1.351 1.351 1.351 1.351
## Cumulative % of var. 43.218 44.570 45.921 47.272 48.624 49.975
## Dim.31 Dim.32 Dim.33 Dim.34 Dim.35 Dim.36
## Variance 0.111 0.111 0.111 0.111 0.111 0.111
## % of var. 1.351 1.351 1.351 1.351 1.351 1.351
## Cumulative % of var. 51.327 52.678 54.029 55.381 56.732 58.083
## Dim.37 Dim.38 Dim.39 Dim.40 Dim.41 Dim.42
## Variance 0.111 0.111 0.111 0.111 0.111 0.111
## % of var. 1.351 1.351 1.351 1.351 1.351 1.351
## Cumulative % of var. 59.435 60.786 62.137 63.489 64.840 66.191
## Dim.43 Dim.44 Dim.45 Dim.46 Dim.47 Dim.48
## Variance 0.111 0.111 0.111 0.111 0.111 0.111
## % of var. 1.351 1.351 1.351 1.351 1.351 1.351
## Cumulative % of var. 67.543 68.894 70.245 71.597 72.948 74.300
## Dim.49 Dim.50 Dim.51 Dim.52 Dim.53 Dim.54
## Variance 0.111 0.111 0.111 0.111 0.111 0.111
## % of var. 1.351 1.351 1.351 1.351 1.351 1.351
## Cumulative % of var. 75.651 77.002 78.354 79.705 81.056 82.408
## Dim.55 Dim.56 Dim.57 Dim.58 Dim.59 Dim.60
## Variance 0.111 0.111 0.111 0.111 0.111 0.111
## % of var. 1.351 1.351 1.351 1.351 1.351 1.351
## Cumulative % of var. 83.759 85.110 86.462 87.813 89.164 90.516
## Dim.61 Dim.62 Dim.63 Dim.64 Dim.65 Dim.66
## Variance 0.082 0.076 0.072 0.069 0.066 0.061
## % of var. 0.997 0.929 0.877 0.842 0.802 0.738
## Cumulative % of var. 91.513 92.442 93.319 94.161 94.963 95.700
## Dim.67 Dim.68 Dim.69 Dim.70 Dim.71 Dim.72
## Variance 0.058 0.055 0.051 0.046 0.043 0.042
## % of var. 0.711 0.665 0.619 0.554 0.521 0.507
## Cumulative % of var. 96.411 97.076 97.696 98.249 98.770 99.277
## Dim.73 Dim.74
## Variance 0.032 0.027
## % of var. 0.391 0.332
## Cumulative % of var. 99.668 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | -0.026 0.001 0.000 | -0.762 0.946 0.033 | 0.936
## 2 | 0.573 0.469 0.076 | -0.306 0.152 0.022 | 0.581
## 3 | 0.357 0.182 0.021 | 0.330 0.178 0.018 | -0.328
## 4 | -0.480 0.328 0.066 | -0.396 0.256 0.045 | -0.068
## 5 | -0.013 0.000 0.000 | 0.083 0.011 0.001 | 0.398
## 6 | -0.233 0.077 0.023 | -0.178 0.052 0.013 | -0.071
## 7 | -0.035 0.002 0.000 | 0.257 0.107 0.012 | -0.021
## 8 | 0.007 0.000 0.000 | 0.446 0.324 0.031 | 0.183
## 9 | 0.412 0.243 0.029 | -0.329 0.176 0.018 | 0.270
## 10 | 0.550 0.432 0.058 | 0.313 0.159 0.019 | 0.239
## ctr cos2
## 1 1.529 0.050 |
## 2 0.589 0.078 |
## 3 0.188 0.018 |
## 4 0.008 0.001 |
## 5 0.276 0.026 |
## 6 0.009 0.002 |
## 7 0.001 0.000 |
## 8 0.058 0.005 |
## 9 0.127 0.012 |
## 10 0.100 0.011 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr
## black | 0.750 6.586 0.184 7.416 | 0.649 5.630
## Earl Grey | -0.217 1.437 0.085 -5.034 | -0.323 3.632
## green | -0.413 0.891 0.021 -2.510 | 0.432 1.115
## alone | -0.154 0.731 0.044 -3.625 | 0.231 1.886
## lemon | -0.264 0.363 0.009 -1.602 | -0.186 0.207
## milk | 0.306 0.932 0.025 2.725 | -0.838 8.008
## other | 2.160 6.653 0.144 6.569 | 1.541 3.863
## tea bag | -0.158 0.675 0.033 -3.129 | -0.221 1.499
## tea bag+unpackaged | 0.357 1.893 0.058 4.164 | -0.041 0.029
## unpackaged | -0.184 0.192 0.005 -1.172 | 1.151 8.626
## cos2 v.test Dim.3 ctr cos2 v.test
## black 0.138 6.418 | 0.671 6.463 0.148 6.642 |
## Earl Grey 0.188 -7.492 | -0.431 6.944 0.335 -10.006 |
## green 0.023 2.628 | 1.014 6.582 0.127 6.167 |
## alone 0.099 5.449 | -0.193 1.413 0.069 -4.556 |
## lemon 0.004 -1.131 | 0.192 0.237 0.005 1.169 |
## milk 0.187 -7.474 | 0.559 3.817 0.083 4.985 |
## other 0.073 4.685 | -0.430 0.322 0.006 -1.307 |
## tea bag 0.064 -4.367 | 0.030 0.030 0.001 0.595 |
## tea bag+unpackaged 0.001 -0.484 | -0.377 2.588 0.065 -4.403 |
## unpackaged 0.181 7.350 | 0.842 4.948 0.097 5.377 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.188 0.191 0.344 |
## How | 0.183 0.257 0.100 |
## how | 0.058 0.187 0.130 |
## sugar | 0.199 0.068 0.003 |
## home | 0.172 0.010 0.007 |
## breakfast | 0.271 0.174 0.007 |
## age | 0.557 0.629 0.695 |
## sex | 0.067 0.009 0.353 |
## frequency | 0.409 0.317 0.081 |
# visualize MCA
plot(mca, invisible=c("ind"), habillage = "quali")
The MCA resulted in a quite a busy graph with not so great a level of variance coverage: just 2.84% for the first dimension and 2.49% for the second. Converting the whole age variable into a factor was not a brilliant idea either; I should have split it into more meaningful bins accroding to age groups.
However, we can see some clusters: people who prefer Earl Grey love taking lemon and sugar with their drink once a day. Black-tea lovers drink it sugarless and twice a day or more. Finally, unfrequent tea-takers (from 1 to 6 times a week) lean towards drinking it not at breakfast, not at home, and without any of the fancy stuff.